How to Determine Package Significance to the Ecosystem

This was motivated by the post: https://medium.com/@nayafia/what-success-really-looks-like-in-open-source-2dd1facaf91c#.xbtww37yy

This particular notebook examines Python-based packages, to determine how significant a package is within that ecosystem. It assumes that you have the relevant packages checked out into a particular directory.


In [6]:
# Installed Package Imports
import os
import matplotlib
%matplotlib inline
from matplotlib import pylab
from collections import Counter
import pandas

# Custom Code Imports
import sys
sys.path.append('../')
import utils

In [24]:
PACKAGES_DIR = '../../source_packages'
packages = os.listdir(PACKAGES_DIR)  # Uncomment to look at what's being examined
print(', '.join(packages))


.DS_Store, cylc, dash, forks, hualos, ipython, iris, jupyter_notebook, keras, letsencrypt, libgpuarray, memegenerator, memegenerator_orig, mkdocs, octohatrack, peewee, plotly.js, plotly.py, pybluez, techradar, tensorflow, Theano, warp-ctc, word2vec, xgboost

In [12]:
# Here we walk the directory, and build a Pandas dataframe of results

relevant_files = utils.yield_relevant(PACKAGES_DIR)
package_generator = utils.yield_all_packages(relevant_files)
df = utils.package_stats(package_generator)

In [29]:
# There is some spurious content causing very large import counts for some terms
# TODO: Genuinely debug this
# FORNOW: Ignore massive counts
df_small = df[df['count'] < 40]
print(df_small.describe())
print()
print(df.head())


             count
count  6718.000000
mean      3.441500
std       5.039497
min       1.000000
25%       1.000000
50%       2.000000
75%       4.000000
max      39.000000

    package  count
1         #    633
2  #786)"""      1
3      #for      1
4      #py2      1
5         %     30

In [21]:
plot = df_small['count'].hist()
plot.set_xlabel("Number of times imported")
plot.set_ylabel("Number of packages in bin")


Out[21]:
<matplotlib.text.Text at 0x106d38668>

In [ ]: